Tech Insights | Digitalization & Automation Expert

How an Emoji Breaks your Software

Have you ever tried to save something simple, like a winking face emoji, only to have your application throw an exception?

I tried to insert this winking face 😉 into a comment in Jira, and suddenly everything broke. It wasn’t just a minor glitch — the entire page crashed, and I had to refresh just to keep working. All that from a tiny emoji. Unbelievable!

It seems impossible that a single, tiny icon could crash a robust system, but after doing some research, I saw that this happens more often than I had thought.

This isn't just a random bug; it's a fascinating look at the evolution of character encoding and how database configurations can't keep up with modern emojis. Let's dig into it.

Understanding Character Encoding: Unicode and UTF-8

To understand the problem, we first need to understand the basics of character encoding. We need to start with Unicode. Think of Unicode as a global dictionary. Its purpose is to match a unique number, called a code point. We could say, that the unicode is the what. Let us check some examples:

Character Unicode UTF-8 (hex) Bytes
A U+0041 41 1
ü U+00FC C3 BC 2
U+6728 E6 9C A8 3
😉 U+1F609 F0 9F 98 89 4

While Unicode is the "what", UTF (Unicode Transformation Format) is the "how". It transforms each Unicode code point into a sequence of bytes, which the computer can actually process. For example, our 'A' with the code point U+0041 becomes the hexdecimal 41, which is one byte.

UTF-8 is like a suitcase, that can be expandable depending on which letter you pack inside.

All characters up to three bytes are within the so-called Basic Multilingual Plane (BMP). But how is it, that our nice-looking winking face 😉 becomes such a big deal?

utf8mb3 vs. utf8mb4

Imagine a boxing match: in the left corner, the veteran utf8mb3, and in the right corner, the modern champion utf8mb4. At first glance they look the same, but when our winking emoji 😉 (code point U+1F609) steps into the ring, things get interesting. Since it belongs to the Supplementary Multilingual Plane (SMP), it requires four bytes to be represented in UTF-8.

Here’s where the trouble begins. For years, many database systems — including older versions of MySQL — used something called utf8. Despite its name, this wasn’t the true UTF-8. It was a cut-down version that only allowed up to three bytes per character. Today, we call this utf8mb3, where “mb3” literally stands for “multi-byte three.”

The limitation is simple but brutal: utf8mb3 cannot represent any character that requires four bytes. So when your app tries to save an emoji like 😉, the database throws an exception — not because the emoji is invalid, but because the encoding is too small to handle it.

The Solution: Migrating to utf8mb4

The solution is to use a character set that is fully compliant with modern UTF-8: utf8mb4. Simply changing the database character set is not enough. You must also ensure that the tables, columns, and connection strings are configured to use utf8mb4.

Steps to Migrate (Dump & Reload Method)

  1. Backup: Create a complete backup of your existing database.
  2. Export: Export the schema and data into a SQL dump file.
  3. Edit: Replace all instances of utf8mb3 with utf8mb4.
  4. New Database: Create a new, empty database with utf8mb4.
  5. Import: Import the modified dump file into the new database.

Conclusion

Crazy, right? The case of the broken emoji 😉 highlights a fascinating technical debt issue. While it may seem like a simple problem, it's also a reminder of how the rapid evolution of digital communication can outpace foundational database technology.

By migrating your database to utf8mb4, you ensure that your application is future-proof and ready to handle any emoji or international character that comes its way.

Especially in our world where we communicate with likes and emojis, your customers will be able to use emojis and special characters freely — without errors or frustration.

Check your own database today: are you still on utf8mb3?